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ABSTRACT 

The abundance of the data in the Internet facilitates the im¬ 
provement of extraction and processing tools. The trend in 
the open data publishing encourages the adoption of struc¬ 
tured formats like CSV and RDF. However, there is still 
a plethora of unstructured data on the Web which we as¬ 
sume contain semantics. For this reason, we propose an 
approach to derive semantics from web tables which are still 
the most popular publishing tool on the Web. The paper 
also discusses methods and services of unstructured data 
extraction and processing as well as machine learning tech¬ 
niques to enhance such a workflow. The eventual result is 
a framework to process, publish and visualize linked open 
data. The software enables tables extraction from various 
open data sources in the HTML format and an automatic 
export to the RDF format making the data linked. The pa¬ 
per also gives the evaluation of machine learning techniques 
in conjunction with string similarity functions to be applied 
in a tables recognition task. 

Categories and Subject Descriptors 

D.2 [Software]: Software Engineering; D.2.8 [Software 
Engineering]: Metrics— complexity measures, performance 
measures 

General Terms 

Algorithms 

Keywords 

Machine Learning, Linked Data, Semantic Web 

1. INTRODUCTION 

The Web contains various types of content, e.g. text, pic¬ 
tures, video, audio as well as tables. Tables are used ev¬ 
erywhere in the Web to represent statistical data, sports 
results, music data and arbitrary lists of parameters. Re¬ 
cent research conducted on the Common Crawl cen- 


suj^ indicated that an average Web page contains at least 
nine tables. In this research about 12 billion tables were ex¬ 
tracted from a billion of HTML pages, which demonstrates 
the popularity of this type of data representation. Tables 
are a natural way how people interact with structured data 
and can provide a comprehensive overview of large amounts 
and complex information. The prevailing part of structured 
information on the Web is stored in tables. Nevertheless, we 
argue that table is still a neglected content type regarding 
processing, extraction and annotation tools. 

For example, even though there are billions of tables on the 
Web search engines are still not able to index them in a way 
that facilitates data retrieval. The annotation and retrieval 
of pictures, video and audio data is meanwhile well sup¬ 
ported, whereas on of the most widespread content types is 
still not sufficiently supported. Assumption that an average 
table contains on average 50 facts it is possible to extract 
more than 600 billion facts taking into account only the 12 
billion sample tables found in the Common Crawl. This is 
alreacR six times more than the whole Linked Open Data 
CZorw^ Moreover, despite a shift towards semantic anno¬ 
tation (e.g. via RDFa) there will always be plain tables 
abundantly available on the Web. With this paper we want 
turn a spotlight on the importance of tables processing and 
knowledge extraction from tables on the Web. 

The problem of deriving knowledge from tables embedded 
in an HTML page is a challenging research task. In order 
to enable machines to understand the meaning of data in a 
table we have to solve certain problems: 

1. Search for relevant Web pages to be processed; 

2. Extraction of the information to work with; 

3. Determining relevance of the table; 

4. Revealing the structure of the found information; 

5. Identification of the data range of the table; 

6. Mapping the extracted results to existing vocabularies 
and ontologies. 

The difference in recognizing a simple table by a human and 
a machine is depicted in Fig. Machine are not easily able 
to derive formal knowledge about the content of the table. 

The paper describes current methodologies and services to 
tackle some crucial Web table processing challenges and in- 
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[estyle type="text/css"> 

.tg {border-collapse:collapse:border-spacing:0:} 

.tg td{font-"family!Arial, sans-serif j font-size !l4px: padding: 10px 

5rx! border-style: solid! border-width :lpx:overf low: hidden! word-break: normal;} 

.tg th'(font-family!Arial, sans-5erif;font-size!l4px;font-weight:normalipadding:10px 
5pxi border-style:solid;border-widthilpx;overflow:hidden;word-break:normal;} 

.tg .tg-s6z2-{text-allgn:center} 

</style><table class="tg" style=‘'undefined;table-layDut: fixed; width: 208px'‘> 

<colgroupxcol style=''width! 41px"xcol style='’width! 55px"xcol style="width! 51px"> 

<col style="width: 61px"></colgroupxtrxth class="tg-031e"x/thxth class=’'tg-s6z2">Pl</th> 
«th class="tg-s6z2 ">P2</th>«th class=''tg-s6z2">P3</thx/trxtr><td class="tg-031e">Ob]l</td> 
<td class="tg-s6z2''>al</td><td class=''tg-s6z2">b2</tdxtd class="tg-s6z2">c3</td></tr><tr> 
<td class="tg-031e">Obj2</td><td class="tg-s622">dl</td><td class=’'tg-s6z2">e2</td> 

<td class="tg-s6z2 ">f3</td></tr><tr><td class=''tg-03le">ObJ3</td><td class="tg-s6z2 ">gi</td> 
<td class="tg-s622">h2</td><td class="tg-s622">i3</tdx/tr><tr><td class="tg-031e">Obj4</td> 
<td class="tg-s622''>jl</tdxtd class="tg-s6z2">k2</tdxtd class="tg-s6z2">l3</td></tr><tr> 
<td class="tg-031e''>Ob)5</tdxtd class="tg-s622‘'>nl</td><td class="tg-s622">op</td> 

<td class-'tg-s 6 z 2 ">xyz</tdx/trx/table> 


Figure 1: Different representations of one table. 


troduces a new approach of table data processing which com¬ 
bines advantages of Semantic Web technologies with robust 
machine learning algorithms. Our approach allows machines 
to distinguish certain types of tables (genuine tables), rec¬ 
ognize their structure (orientation check) and dynamically 
link the content with already known sources. 

The paper is structured as follows: Section 2 gives an overview 
of related studies in the field of unstructured data process¬ 
ing. Section 3 presents Web services which provide the user 
with table extraction functions. Section 4 describes the ap¬ 
proach and establishes a mathematical ground for a further 
research. Section 5 presents used machine learning algo¬ 
rithms and string distance functions. Section 6 showcases 
the evaluation of the approach. Finally, we derive conclu¬ 
sions and share plans for future work. 


2. RELATED WORK 

The Linked Open Data concept raises the question of au¬ 
tomatic tables processing as one of the most crucial. Open 
Government Data is frequently published in simple HTML 
tables that are not well structured and lack semantics. Thus, 
the problems discussed in the paper concern methods 
of acquiring datasets related to roads repair from the gov¬ 
ernment of Saint Petersburg. There is also a raw approach 
[10| in information extraction, which is template-based and 
effective in processing of web sites with unstable markup. 
The crawler was used to create a LOD dataset of CEUR 


Workshoi:|^ proceedings. 


A.C. e Silva et al. in their paper suggest and analyze an 
algorithm of table research that consists of five steps: loca¬ 
tion, segmentation, functional analysis, structural analysis 
and interpretation of the table. The authors provide a com¬ 
prehensive overview of the existing approaches and designed 
a method for extracting data from ASCII tables. However, 
smart tables detection and distinguishing is not considered. 


J. Hu et al. introduced in the paper the methods for 
table detection and recognition. Table detection is based on 
the idea of edit-distance while table recognition uses random 

®Web: http://ceur-ws.org/ 


graphs approach. M. Hurst takes into consideration ASCII 
tables and suggests an approach to derive an abstract 
geometric model of a table from a physical representation. A 
graph of constraints between cells was implemented in order 
to determine position of cells. Nevertheless, the results are 
rather high which indicates the efficiency of the approach. 
The authors of the papers achieved significant success in 
structuring a table, but the question of the table content 
and its semantic is still opened. 

D. Embley et al. tried to solve the table processing prob¬ 
lem as an extraction problem with an introduction of ma¬ 
chine learning algorithms. However, the test sample was 
rather small which might have been resulted in overfitting 



W. Gatterbauer et al. in the paper developed a new 
approach towards information extraction. The emphasis is 
made on a visual representation of the web table as rendered 
by a web browser. Thus the problem becomes a question of 
analysis of the visual features, such as 2-D topology and 
typography. 

Y. A. Tijerino et al. introduced in TANGO approach 
(Table Analysis for Generating Ontologies) which is mostly 
based on WordNet with a special procedure of ontology 
generation. The whole algorithm implies 4 actions: table 
recognition, mini-ontology generation, inter-ontology map¬ 
pings discovery, merging of mini-ontologies. During the ta¬ 
ble recognition step search in WordNet support the process 
of table segmentation su that no machine learning algo¬ 
rithms were applied. 

V. Crescenzi et al. introduced in an algorithm Road¬ 
Runner for an automatic extraction of the HTML tables 
data. The process is based on regular expressions and match¬ 
es/mismatches handling. Eventually a DOM-like tree is 
built which is more convenient for an information extrac¬ 
tion. However, there are still cases when regular expressions 
are of little help to extract any data. 

To sum up, there are different approaches to information 
extraction developed last ten years. In our work we intro¬ 
duce an effective extraction and analyzing framework built 
on top of those methodologies combining tables recognition 
techniques, machine learning algorithms and Semantic Web 
methods. 

2.1 Existing Data Processing Services 

Automatic data extraction has always been given a lot of at¬ 
tention from the Web community. There are numerous web- 
services that provide users with sophisticated instruments 
useful in web scraping, web crawling and tables processing. 
Some of them are presented below. 

2.1.1 ScrmerWiki 

ScraperWikjjis a powerful tool based on subscription model 
that is suitable for software engineers and data scientists 
whose work is connected with processing of large amounts 
of data. Being a platform for interaction between business, 
journalists and developers, ScraperWiki allows users to solve 

“^Web: https://scraperwiki.com/ 
















extracting and cleaning tasks, helps to visualize acquired 
information and offers tools to manage retrieved data. Some 
of the features of the service: 

• Dataset subscription makes possible the automatic track¬ 
ing, update and processing of the specified dataset in 
the Internet. 

• A wide range of data processing instruments. For in¬ 
stance, information extraction from PDF documents 

ScraperWiki allows one to parse web tables in CSV format, 
but processes all the tables on the page even thought they 
do not contain relevant data, e.g. layout tables. Also the 
service does not provide any Linked Data functionality. 

2.1.2 Scrapy. 

Scrap}0 is a fast high-level framework written in Python 
for web-scraping and data extraction. Scrapy is spread un¬ 
der BSD license and available on Windows, Linux, MacOS 
and BSD. Merging performance, speed, extensibility and 
simplicity Scrapy is a popular solution in the industry. A 
lot of services are based on Scrapy, such as ScraperWiki or 
PriceWikj3 

2.1.3 Bitrake. 

Bitrak^ is a subscription based tool for scraping and pro¬ 
cessing the data. Bitrake offers a special service for those 
who are not acquainted with programming and provides 
an API for experienced developers written in Ruby and 
JavaScript. One of the distinctive features of the service 
is a self-developed scraping algorithm with simple filtering 
and configuration options. Bitrake is also used in monitoring 
and data extraction tasks. 

2.1.4 Apache Nutch. 

Nutclj^is an open source web crawler system based on Apache 
Lucene and written in Java. Main advantages of Nutch are 
performance, flexibility and scalability. The system has a 
modular architecture which allows developers to create cus¬ 
tom extensions, e.g. extraction, transforming extensions or 
distribute computing extensions. On the other hand, the 
tool does not have built-in Linked Data functionality, which 
requires additional development. 

2.1.5 Import, io. 

Import. icj^is am emerging data processing service. Compre¬ 
hensive visualization and an opportunity to use the service 
without programming experience tend Import.io to become 
one of the most wide-spread and user-friendly software. The 
system offers users three methods of extraction arranged by 
growing complexity: an extractor, a crawler and a connec¬ 
tor. The feature of automatic table extraction is also imple¬ 
mented but supports only CSV format. 

3. CONCEPT 

In order to achieve the automatic tables processing certain 
problems have to be solved: 
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1. HTML tables search and localization from a URL pro¬ 
vided by the user; 

2. Computing of appropriate heuristics; 

3. Table genuineness check, in other words, check whether 
a table contains relevant data; 

4. Table orientation check (horizontal or vertical); 

5. Transformation of the table data to an RDF model. 

The importance of correct tables recognition affects the per¬ 
formance of most of web-services. The growth of the data on 
the Web facilitates the data- and knowledge bases updates 
with such a frequency, that does not allow errors, inconsis¬ 
tency or ambiguity. With the help of our methodology we 
aim to address the challenges of automatic knowledge ex¬ 
traction and knowledge replenishment. 

3.1 Knowledge retrieval 

Knowledge extraction enables the creation of knowledge bases 
and ontologies using the content of HTML tables. It is also a 
major step towards five-star open datep^ making the knowl¬ 
edge linked with other datasources and accessible, in addi¬ 
tion, in a machine-readable format. Thus, a correct table 
processing and an ontology generation is a crucial part of 
the entire workflow. Our framework implements learning 
algorithms which allow automatic distinguishing between 
genuine and non-genuine tables [^, as well as automatic 
ontology generation. 

We call a table genuine when it contains consistent data 
(e.g. the data the user is looking for) and we call a table 
non-genuine when it contains any HTML page layout infor¬ 
mation or a rather useless content, e.g. a list of hyperlinks 
to other websites within one row or one column. 


3.2 Knowledge acquisition 

Knowledge replenishment raises important questions of data 
updating and deduplication. A distinctive feature of our 
approach is a fully automatic update from the datasource. 
The proposed system implements components of the pow¬ 
erful platform Information Workbench which introduces 
the mechanism of Data Providers. Data Providers observe 
a datasource specified by a user and all its modifications 
according to a given schedule. Therefore, it enables the re¬ 
plenishment of the same knowledge graph with new entities 
and facts, which, in turn, facilitates data deduplication. 

4. FORMAL DEFINITIONS 

The foundation of the formal approach is based on ideas of 
Ermilov et al. 

Definition 1. A table T = is tuple consisting of a 

header T-L and data nodes N, where: 

• the header ^ = {/ii, h 2 ,.. ., /i„} is an n-tuple of header 
elements hi. We also assume that the set of headers 
might be optional, e.g. 3T = N. If the set of headers 
exists, it might be either a row or a column. 

^°Web: http://5stardata.info/ 






Table 1: Horizontal orientation 
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Table 2: Vertical orientation 
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Thus, it becomes obvious, that the heuristics are used in 
order to solve the classification problem X = R”, Y = 
{ —1, +1} where the data sample is X* = {xi, yi)\^\ and the 
goal is to find the parameters w £ R"', wo £ R so that: 

a(x,w) = sign{{x,w) — Wo). (3) 

We describe heuristics and machine learning algorithms in 
detail in Section From description logics we define the 
terminological component TBoxt as a set of concepts over 
the set of headers H. We define the assertion component 
ABoxt as a set of facts over the set of grid nodes N. 

Hypothesis 1 Tables in unstructured formats contain se¬ 
mantics. 


/ Cl,l 

Cl,2 • 

Ci,Tn\ 

C2.1 

C2,2 

C2,m 


Cn,2 

■ Cn , m / 


the data nodes N = 


(n,m) matrix consisting of n rows and m columns. 


TBoxj.^X't ABoxj.) (4) 

In other words, there are tables with the relevant content, 
which could be efficiently extracted as knowledge. 

The evaluation of the hypothesis is presented in the Section 

m 


Definition 2. The genuineness of the table is a parame¬ 
ter which is computed via the function of grid nodes and 
headers, so that Gt{M,TL) = gu € G,G = {true, f lose}. 


Definition 3. The orientation of the table is a parameter 
which values are defined in the set O = {horizontal, vertical} 
if and only if = true. So that 30t Gt = 

true . The orientation of the table is computed via a func¬ 
tion of grid nodes and headers, so that Ot{M, TL) = Ok £ O. 
If the orientation is horizontal, then the headers are pre¬ 
sented as a row, so that Or = horizontal. If the orientation 
is vertical, then the headers are presented as a column, so 
that Ot = vertical. 


Definition f. A set of heuristics V is a set of quantitative 
functions of grid nodes A/” and headers TC which is used by 
machine learning algorithms in order to define genuineness 
and orientation of the given table T. 


GT{Af,H) 


true, Vgl £ [UglminT''^9lma3:]j . . .^gm 

false, otherwise. 


where Vgi,...,Vgm £ V, is the range of val¬ 

ues of Vgl necessary for the true value in conjunction with 
Vg 2 ,...,Vgm functions. 


OT{M,n) 


horizontal, Vhi £ 14„. 

vertical, Ki £ [vvi^i„,Voi,„^J, ...,Voi. 

( 2 ) 


where 14i,...,Vhn £ V, Vvi,...,Vvi £ V, is 

the range of values of 14 1 necessary for the horizontal value 
in conjunction with Vh 2 , ...,Vhn functions, [vvi^i„,Vvi.„^an;] 
is the range of values of 14i necessary for the vertical value 
in conjunction with 142 ,..., 14i functions. 


4.1 Algorithm Description _ 

Algorithm 1 The workflow of the system 
1: procedure Workflow 
2: URl ■£- specified by the user 

3: tables localization 

4: n ■<— found tables 

5: while n > 0 : 

6: n n — 1. 

7: if genuine = true then 

8: orientation check. 

9: RDF transformation. 

10: goto while. 


The valid URL of the website where the data is situated is 
required from the user. 

The next step is a process of search and localization of 
HTML tables on a specified website. One of the essential 
points in the process is to handle DOM < table > leaf nodes 
in order to avoid nested tables. Most of the systems de¬ 
scribed in |2.1| suggest a user with all the extracted tables, 
whether they are formatting tables or relevant tables. In 
contrast, our approach envisions full automation with the 
subsequent ontology generation. 

The next step is a computation of heuristics for every ex¬ 
tracted table. Using a training set and heuristics a machine 
learning algorithm classifies the object into a genuine or a 
non-genuine group. The input of the machine learning mod¬ 
ule is a table trace - a unique multi-dimensional vector of 
computed values of the heuristics of a particular table. Us¬ 
ing a training set the described classifiers decide, whether the 
vector satisfies the genuineness class requirements or not. If 
the vector is decided to be genuine the vector then is ex¬ 
plored by classifiers again in attempt to define the orienta¬ 
tion of the table. 

The correct orientation determination is essential for correct 
transformation of the table data to semantic formats. Then 


























it becomes possible to divide data and metadata of a table 
and construct an ontology. 


that the appropriated changes have been implemented 
into algorithms. 


If the table is decided to be a non-genuine then a user re¬ 
ceives a message where it is stated that a particular table is 
not genuine according to the efficiency of a chosen machine 
learning method. However, the user is allowed to manually 
mark a table as a genuine which in turn modihes machine 
learning parameters. 

5. MACHINE LEARNING METHODS 

Machine learning algorithms play vital role in the system 
which workflow is based on appropriate heuristics which in 
turn are based on string similarity functions. The nature of 
the problem implies usage of mechanisms that analyze the 
content of the table and calculate a set of parameters. In 
our case the most suitable option is implementation of string 
similarity (or string distance) functions. 

5.1 String Metrics 

String metric is a metric that measures similarity or dis¬ 
similarity between two text strings for approximate string 
matching or comparison. In the paper three string distance 
functions are used and compared. 

Levenshtein distance is calculated as a minimum num¬ 
ber of edit operations (insertions, substitutions, deletions) 
required to transform one string into another. Characters 
matches are not counted. 

Jaro-Winkler distance is a string similarity metric and 
improved version of the Jaro distance. It is widely used to 
search for duplicates in a text or a database. The numerical 
value of the distance lies between 0 and 1 which means two 
strings are more similar the closer the value to 1. 

Being popular in the industry n-gram is a sequence of 
n items gathered from a text or a speech. In the paper n- 
grams are substrings of a particular string with the length 

n. 


5.2 Improvements 

Due to the mechanism of tables extraction and analysis cer¬ 
tain common practices are improved or omitted. Hence par¬ 
ticular changes in string similarity functions have been im¬ 
plemented: 

• The content type of the cell is more important than 
the content itself. Thus it is reasonable to equalize all 
numbers and count them as the same symbol. Nev¬ 
ertheless the order of magnitude of a number is still 
taken into account. For instance, the developed sys¬ 
tem recognizes 3.9 and 8.2 as the same symbols, but 
223.1 and 46.5 would be different with short distance 
between these strings. 

• Strings longer than three words have fixed similarity 
depending on a string distance function in spite of 
previously described priority reasons. Moreover, ta¬ 
bles often contain fields like “description” or “details” 
that might contain a lot of text which eventually might 
make a mistake during the heuristics calculations. So 


5.3 Heuristics 

Relying on the theory above it is now possible to construct 
a set of primary heuristics. The example table the heuristics 
mechanisms are explained with is: 


Table 3: An example table 


Name 

City 

Phone 

e-mail 

Ivanov LI. 

Berlin 

1112233 

ivanov@mail.de 

Petrov P.P 

Berlin 

2223344 

petrov@mail.de 

Sidorov S. S. 

Moscow 

3334455 

sidorov@ya.ru 

Pupkin V.V. 

Moscow 

4445566 

pupkinv@gmail.com 


5.3.1 Maximum horizontal cell similarity. 

The attribute indicates the maximum similarity of a par¬ 
ticular pair of cells normalized to all possible pairs in the 
row found within the whole table under the assumption of 
horizontal orientation of a table. It means the hrst row of 
a table is not taken into account because of a header of a 
table (see Table [^. Having in mind the example table the 
strings “Ivanov 1.1.” and “ivanov@mail.de” are more similar 
to each other than “Ivanov 1.1.” and “1112233”. 


The parameter is calculated under the certain rule: 


maxSimHor = maxi =2 




(5) 

where i - a row, n - number of rows in a table, J - a column, 
m - number of columns in a table, dist() - a string similarity 
function, Cij- - a cell of a table. 


5.3.2 Maximum vertical cell similarity. 

The attribute indicates the maximum similarity of a par¬ 
ticular pair of cells normalized to all possible pairs in the 
column found within the whole table under the assumption 
of vertical orientation of a table. It means the first column 
of a table is not taken into account because in most cases it 
contains a header (see Table |^. According to the example 
table the parameter calculated for this table would be rather 
high because it contains pairs of cells with the same content 
(for instance “Berlin” and “Berlin”). 


Using the same designations the parameter is calculated: 


maxSimVert = maxj= 2 ,m 




( 6 ) 

It is obvious that the greater the maximum horizontal sim¬ 
ilarity the greater a chance that a table has vertical orien¬ 
tation. Indeed, if the distance between values in a row is 
rather low it might mean that those values are instances 
of a particular attribute. The hypothesis is also applicable 
to the maximum vertical similarity which indicates possible 
horizontal orientation. 


5.3.3 Average horizontal cell similarity. 

The parameter indicates average similarity of the content 
of rows within the table under the assumption of horizontal 












orientation of a table. Again, the first row is not taken into 
account. The parameter is calculated under the certain rule: 


avgSimHor 


1 -A distjcij^, 

n 

i = 2 


(7) 


where i - a row, n - number of rows in a table, j - a column, 
m - number of columns in a table, dist() - a string similarity 
function, c[i,j] - a cell of a table. 


The main difference between maximum and average param¬ 
eters is connected with size of a table. Average parame¬ 
ters give reliable results during the analysis of large tables 
whereas maximum parameters are applicable in case of small 
tables. 


5.3.4 Average vertical cell similarity. 

The parameter indicates average similarity of the content of 
columns within the table under the assumption of vertical 
orientation of a table. The first column is not taken into 
account. 


avgSimVert 


1 ^ Eri=i Er2=i 

m 11 ? 

J=2 


( 8 ) 


5.4 Machine Learning Algorithms 

With the formalization established we are now ready to build 
classifiers which use apparatus of machine learning. Four 
machine learning algorithms are considered in the paper: 

Naive Bayes classifier is a simple and popular machine 
learning algorithm. It is based on Bayes’ theorem with 
naive assumptions regarding independence between param¬ 
eters presented in a training set. However, this ideal con¬ 
figuration rarely occurs in real datasets so that the result 
always has a statistical error [11| . 

A decision tree is a predictive model that is based on 
tree-like graphs or binary trees Branches represent a 

conjunction of features and a leaf represents a particular 
class. Going down the tree we eventually end up with a leaf 
(a class) with its own unique configuration of the features 
and values. 


k-nearest neighbours is a simple classifier based on a dis¬ 
tance between objects 16 . If an object might be represented 


in Euclidean space then there is a number of functions that 
could measure a distance between these objects. If the ma¬ 
jority of neighbours of the object belongs to one class than 
the object would be classified into the same class. 


Support Vector Machine is a non-probabilistic binary 
linear classifier that tries to divide instances of classes pre¬ 
sented in a training set by a gap as wide as possible. In other 
words, SVM builds separating surfaces between categories, 
which might be linear or non-linear 19 . 


6. IMPLEMENTATION 

The proposed approach was implemented in Java as a plu¬ 
gin for the Information Workbenclj^ platform developed by 

'^^Web: http: //www. f luidops . com/en/portf olio/ 

inf ormation_workbench/ 


fluidOps. The platform provides numerous helpful APIs re¬ 
sponsible for the user interaction, RDF data maintenance, 
ontology generation, knowledge bases adapters and smart 
data analysis. The machine learning algorithms are supplied 
by WEKj^- a comprehensive data mining Java framework 
developed by the University of Waikato. The SimMetric^^ 
library by UK Sheffield University provided string similarity 
functions. The plugin is available on a public repositorvM 
both as a deployable artifact for the Information Workbench 
and in source codes. 


7. EVALUATION 

The main goal of the evaluation is to assess the performance 
of the proposed methodology in comparison with the exist¬ 
ing solutions. By the end of the section we decide whether 
the hypothesis made in Section is demonstrated or not. 
The evaluation consists of two subgoals - the evaluation of 
machine learning algorithms with string similarity functions 
and the evaluation of the system as a whole. 

7.1 Algorithms Evaluation 

A training set of 400 tables taken from the corpuj^^ as a 
result of was prepared to test suggested heuristics and 
machine learning methods. In addition, the efficiency of al¬ 
gorithms modifications was estimated during the tests. Re¬ 
sults are presented on the Table and Fig. |2|3| In case 
of the genuineness check Precision, Recall and F-Measure 
were computed. 

Fig. i represents the overall fracture of correctly classified 
genuine and non-genuine tables w.r.t. used machine learn¬ 
ing algorithms and string similarity functions. The machine 
learning algorithm based on kNN in conjunction with Lev- 
enshtein distance or n-grams demonstrated the highest effi¬ 
ciency during the genuineness check. A slight increase in ef¬ 
ficiency in spite of modifications is observed mostly for kNN. 
It also could be noted that overall results of classification are 
generally lower in comparison with orientation classification 
task. This may indicate a lack of information about the table 
structure caused by a small amount of heuristics. Develop¬ 
ment and implementation of more sophisticated numerical 
parameters is suggested in order to improve the performance 
of classification. Hence, the way towards improving overall 
F-Measure is connected with raising Recall of the approach. 

Fig. indicates the high efficiency of the orientation check 
task. Most of the used machine learning methods except 
Naive Bayes demonstrated close to 100% results. A rel¬ 
atively low result of Naive Bayes regardless of the chosen 
string similarity function might be explained by a number 
of assumptions which the method is established on. On the 
one hand the algorithm has the advantage of simplicity and 
on the other hand it might overlook important details which 
affect the classification process because of such simplicity. 
During the orientation check only genuine tables are consid¬ 
ered and assessed. Therefore, the eventual result is Preci¬ 
sion. 

^^Web: http: //www. cs . waikato. ac .nz/ml/weka/ 

^^Web: http: //sourceforge .net/projects/simmetrics/ 

^“^Web: https : //github. com/migalkin/Tables_Provider 

^®WDC - Web Tables. Web: http://webdatacomnions.org/ 
webtables/ 









Table 4: Evaluation of the genuineness check 


Method 

Precision 

Recall 

F-Measure 


Levenshtein 

unmodified 

0.925 

0.62 

0.745 


modified 

0.93 

0.64 

0.76 

Naive Bayes 

Jaro-Winkler 

unmodified 

0.939 

0.613 

0.742 

modified 

0.939 

0.617 

0.744 


n-grams 

unmodified 

0.931 

0.633 

0.754 


modified 

0.937 

0.643 

0.763 


Levenshtein 

unmodified 

0.928 

0.65 

0.765 


modified 

0.942 

0.653 

0.76 

Decision Tree 

Jaro-Winkler 

unmodified 

0.945 

0.637 

0.761 

modified 

0.946 

0.64 

0.763 


n-grams 

unmodified 

0.933 

0.603 

0.733 


modified 

0.945 

0.637 

0.76 


Levenshtein 

unmodified 

0.904 

0.623 

0.74 


modified 

0.943 

0.667 

0.78 

kNN 

Jaro-Winkler 

unmodified 

0.928 

0.607 

0.734 

modified 

0.941 

0.64 

0.762 


n-grams 

unmodified 

0.948 

0.663 

0.78 


modified 

0.949 

0.677 

0.79 


Levenshtein 

unmodified 

0.922 

0.597 

0.725 


modified 

0.93 

0.62 

0.744 

SVM 

Jaro-Winkler 

unmodified 

0.924 

0.61 

0.735 

modified 

0.926 

0.623 

0.745 


n-grams 

unmodified 

0.922 

0.627 

0.746 


modified 

0.927 

0.637 

0.755 
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Figure 2: Genuineness check, correctly classified, % 
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Figure 3: Orientation check, correctly classified, % 



















































































Figure 4: Table recognition results 


Having analyzed the efficiency of machine learning methods 
with string metric mechanisms we decided to apply modi¬ 
fied kNN in conjnnction with Levenshtein distance during 
the genuineness check process and modified SVM in con¬ 
junction with Levenshtein distance during the orientation 
check process. 

7.2 System Evaluation 

The overall performance of the approach is defined as a prod¬ 
uct of the highest F-Measure of the genuineness check and 
the highest Precision of the orientation check, which results 
in 0.77 or 77%. It is therefore indicating, that we are able 
to correctly extract knowledge at least from three of given 
four arbitrary tables. 

On Fig. 1^ the results of tables recognition are presented. 
All the tables that are marked in HTML code of web-pages 
as tables are coloured in red and blue. The tables were ex¬ 
tracted from the websites of Associated Presj^ Sports.r €3 
and Saint Petersburg Government PortaJ^ According to 
the theory those tables might be divided in genuine and non- 
genuine (relevant or irrelevant) groups. It might be easily 
noted that the tables coloured in red use the tag for for¬ 
matting reasons and do not contain appropriate table data. 
In contrast, the tables coloured in blue are relevant tables 
which data might be parsed and processed. ScraperWiki was 
able to extract all the red and blue tables. The user there¬ 
fore should choose relevant tables for a further processing. 
As a counter to ScraperWiki the developed system was able 
to extract and process only blue genuine tables using appro¬ 
priate heuristics and machine learning algorithms. 

Associated Press. Web: http://www.aptn.com/ 
'^^Sports.ru. Web: http://www.sports.ru/ 

Roads repair dataset | Official Website of Government of 
Saint Petersburg. Web: http://gov.spb.ru/gov/otrasl/ 
tr_infr_kom/tekobjekt/tek_rem/ 


Taking into account the achieved results we consider the hy¬ 
pothesis suggested in Section demonstrated. Indeed, un¬ 
structured data contains semantics. Hence, the next ques¬ 
tions are raised. How much semantics does unstructured 
data contain? Is there an opportunity to semantically inte¬ 
grate tables with other types of Web content? Answering the 
questions will facilitate the shift from neglecting the tables 
towards close integration of all the Web content. 

8. CONCLUSION AND FUTURE WORK 

Automatic extraction and processing of unstructured data is 
a fast-evolving topic in science and industry. Suggested ma¬ 
chine learning approach is highly effective in table structure 
analysis tasks and provides the tools for knowledge retrieval 
and acquisition. 

To sum up, the system with the distinctive features was 
developed: 

1. Automatic extraction of HTML tables from the sources 
specified by a user; 

2. Implementation of string metrics and machine learning 
algorithms to analyze genuineness and structure of a 
table; 

3. Automatic ontology generation and publishing of the 
extracted dataset; 

4. The software takes advantages of Information Work¬ 
bench API, enabling data visualization, sharing and 
linking 

Future work concerns the question of ontology mapping. 
The datasets to be extracted might be linked with the al¬ 
ready existing ones in the knowledge base dynamically dur¬ 
ing the main workflow, e.g. discovery of the same entities 
and relations in different datasets. It will facilitate the de¬ 
velopment of the envisioned Web of Data as well as wide 
implementation of Linked Open Data technologies. 
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